{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Chapter 3: Text Processing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this chapter we will write several functions for text processing. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Sentence Splitting"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write a function `split_sentences` that performs a very simple sentence splitting when passed a list of tokens. Each sentence should be represented as a list of tokens, so the function as a whole returns a list of lists of tokens.\n",
      "\n",
      "A tuple of end-of-sentence markers is given. You may assume that any occurrence of an end-of-sentence marker marks the end of a sentence. In reality, this is more ambiguous of course, consider for example the use of periods as end-of-sentence marker as well as in abbreviations and initials!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "end_of_sentence_markers = ('.', '!', '?')\n",
      "\n",
      "def split_sentences(tokens):\n",
      "    # write your code here\n",
      "\n",
      "# these tests should return True if your code is correct\n",
      "print(split_sentences( ['This','is','one','sentence','.','This','is','the','second','.']) == \n",
      " [ ['This','is','one','sentence','.'],['This','is','the','second','.'] ])\n",
      "print(split_sentences( ['Hello', 'world','!','I','know','Python','!']) == \n",
      "[ ['Hello','world','!'],['I','know','Python','!'] ])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Bonus exercise"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Adapt the sentence splitter to also make it work when end-of-sentence markers are found in repetition, such as in ellipsis (...). A sentence should only be split at the last of such markers."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Simple Tokenisation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Tokenisation is a very common pre-processing stage for many NLP tasks. We are going to write a simple tokeniser which detaches any punctuation attached to words and produces a list of tokens. Write a function `tokenise(text)` which takes a string as parameter and returns a list of tokens. Punctuation characters should become separate tokens, whitespace and newlines never become tokens."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "PUNCTUATION = (\".\",\",\",\":\",\";\",\"\\\"\",\"'\",\"!\",\"?\",\"(\",\")\",\"[\",\"]\",\"/\")\n",
      "WHITESPACE = (\" \", \"\\t\", \"\\n\", \"\\r\")\n",
      "\n",
      "def tokenise(text):\n",
      "    # write your code here\n",
      "\n",
      "# The following tests should all return True if your code is correct \n",
      "print(tokenise(\"This is a test\") == [\"This\",\"is\",\"a\",\"test\"])\n",
      "print(tokenise(\"To be or not to be, that is the question!\") == [\"To\",\"be\",\"or\",\"not\",\"to\",\"be\",\",\",\"that\",\"is\",\"the\",\"question\",\"!\"])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now combine this with the sentence splitter we wrote above: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "quote = \"To be or not to be, that is the question; Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles, and by opposing, end them.\"\n",
      "sentences = split_sentences(tokenise(quote))\n",
      "print(sentences)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Extracting n-grams *"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "N-grams are very common in NLP applications. An *n*-gram is a token sequence of length *n*. We are going to write a function `get_ngrams(tokens, n)`, where the first parameter is a longer sequence of tokens (such as a sentence), to extract from. The last argument specifies the size of the *n*-gram we want to extract. The goal is to extract all *n*-grams present in the list of tokens. Each *n*-gram should be either a list or tuple of length *n*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_ngrams(tokens, n):\n",
      "    # insert your code here\n",
      "    \n",
      "quote = ['To','be','or','not','to','be']\n",
      "print(get_ngrams(quote,3) ==  [['To','be','or'],['be','or','not'],['or','not','to'],['not','to','be']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Computing a Frequency Distribution"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this exercise we are going to create a frequency distrbution given a set of sentences, again in which each sentence consists of a list of tokens. We are thus going to find how many times each type of word occurs in the text."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What data structure would be most suitable for a frequency distribution? A dictionary! They keys will be words, and the value will be the number of times the word occurs."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For this purpose however, we are going to introduce a new kind of dictionary rather than the built-in one. Recall the following:\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "freq_dist = {'to': 2, 'be': 2, 'or': 1, 'not': 1}\n",
      "print(freq_dist['to'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(freq_dist['was'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Specifying a non-existent key on a dictionary will give a KeyError. We are going to use a special dictionary called a `defaultdict` which circumvents this problem and returns a default value when a key does not exist. In this situation we would like a default value of zero (an integer). Create a `defaultdict` for integers as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from collections import defaultdict\n",
      "freq_dist = defaultdict(int)\n",
      "print(freq_dist['was'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Except for this behaviour, the defaultdict behaves just like a normal dictionary. Now write a function `make_frequency_distribution(sentences)` that creates a frequency distribution for all passed sentences (each consisting of a list of tokens). In computing the frequency distribution, use lower-case only, thus removing any case-sensitivity."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from collections import defaultdict\n",
      "\n",
      "def make_frequency_distribution(sentences):\n",
      "    freq_dist = defaultdict(int)\n",
      "    # write your code here\n",
      "    return freq_dist\n",
      "\n",
      "sentences = [['To','be','or','not','to','be'],['that','is','the','question']]\n",
      "freq_dist = make_frequency_distribution(sentences)\n",
      "\n",
      "# These tests should all return True if your implementation is correction\n",
      "print(freq_dist['be'] == 2)\n",
      "print(freq_dist['or'] == 1)\n",
      "print(freq_dist['to'] == 2) # only works if not case-sensitive"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Bonus exercise"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Rather than creating a frequency list for words, we can also create a frequency list for *n*-grams. Write a new version of `make_frequency_distribution2(sentences, n)` which makes a frequency distribution using *n*-grams of the specified size. Reuse the `get_ngrams()` function you wrote earlier. Make sure to use tuples instead of lists as dictionary key; only immutables like tuples can be dictionary keys, so lists won't work. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from collections import defaultdict\n",
      "\n",
      "def make_frequency_distribution2(sentences, n):\n",
      "    freq_dist = defaultdict(int)\n",
      "    # write your code here\n",
      "    return freq_dist\n",
      "\n",
      "sentences = [['To','be','or','not','to','be'],['that','is','the','question']]\n",
      "freq_dist = make_frequency_distribution2(sentences, 2)\n",
      "\n",
      "# These tests should all return True if your implementation is correction\n",
      "print(freq_dist[('to','be')] == 2)\n",
      "print(freq_dist[('or','not')] == 1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Finding files and reading a corpus"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Remember how to read files? Write a function `read_corpus_file(filepath)` that reads the specified file and simply returns all contents as a single string. Use the utf-8 encoding."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def read_corpus_file(filepath):\n",
      "    # write your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we have a corpus consisting of multiple *.txt files in the `data/` directory. We want to iterate over all these files. You can do this with the `glob` function from the `glob` module. Consider the following function:\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from glob import glob\n",
      "\n",
      "def find_corpus_files(corpus_directory):\n",
      "    return glob(corpus_directory + \"/*.txt\") "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now if you correctly implemented `read_corpus_file`, the following code will read all text files and outputs the length (in characters) of each:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for filepath in find_corpus_files(\"data/\"):\n",
      "    text = read_corpus_file(filepath)\n",
      "    print(filepath + \" has \" + str(len(text)) + \" characters\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Generators"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we would like to introduce a special and very useful kind of function in Python: a *generator*. A normal function tends to **return** a value, which marks the end of its execution. A generator returns a value and **yields**, giving the value back to the caller, but on the next iteration it continues where it left of! Consider the following examples with the same functionality, both receive a list of numbers and return the double of each element:\n",
      "    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def double_numbers(numbers):\n",
      "    new_numbers = []\n",
      "    for number in numbers:\n",
      "        new_numbers.append(number * 2)\n",
      "    return new_numbers\n",
      "\n",
      "for number in double_numbers([1, 2, 3, 4, 5]):\n",
      "    print(number)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And now as a generator:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def double_numbers2(numbers):\n",
      "    for number in numbers:\n",
      "        yield number * 2\n",
      "        \n",
      "for number in double_numbers2([1, 2, 3, 4, 5]):\n",
      "    print(number)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The advantage of a generator is not just its elegance, but also the fact that a generator generally conserves memory. In the example above, it does not have to build a temporary list in memory. Many of the functions we have seen, including `enumerate()`, `zip()`, `range()`, `glob()` etc... are actually generators!  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Bonus Quiz"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There is an even more elegant method to quickly manipulate a list, on one-line. Do you remember which and how it works?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# try here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Tying things together"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this chapter we have created functions for text preprocessing. We are going to bring all of these together and build a module `preprocess.py` (Python code always has the extension .py) which we can then conveniently use in later scripts."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For this we are going to leave this nice browser environment, as in practise you will not be using this either. We will be using a capable text editor capable to write Python scripts. Though any text editor will do, it is nice if we have an editor that knows about Python and which can colour our code so it is more pleasant to read. We will be using the editor Sublime Text 2, available for all platforms. But if you have another favourite editor already, feel free to use it. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the difference between a text editor and a word processor such as MS Word or OpenOffice. A text processor edits plain-text files, a word processor includes all kind of markup and is **not** a plain text editor.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So open your text editor, and start by copying the functions `split_sentences`, `tokenise`, `get_ngrams`, `make_frequency_distribution`, `read_corpus_file`, and `find_corpus_files`... Don't forget any imports you might need."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then add a **generator** function `readcorpus`, which iterates over all files in the corpus and on each iteration yields a 2-tuple `(filename, sentences)` . That is, the filename of the corpus file found, and the sentences (split and tokenised using the functions we wrote earlier)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once we have a script, we can run it by executing: python3 *preprocess.py*     on the command line. Since our script has only functions and nothing to invoke it, it will not output anything, but it should not give any errors either.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# do not write your code here ;-) write it in an editor!"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Passing parameters from the command-line *"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One commonly used method of passing data to a program is by specifying it as parameters on the command line when calling your program. Your Python program can then use `sys.argv` to have access to these parameters. `sys.argv` is a list containing all parameters passed to the program, the first element is always the program name itself so the actual parameters start at index 1. Look at the following program. It will make use of the preprocessing module we just wrote! It expects two parameters from the command line, a filename and a plain text file and a value for n. It will then extract and output all n-grams of size n from the text file. \n",
      "\n",
      "You don't need to write any code here, but do see if you understand all that happens. Just copy this code in your editor and save it as `extractngrams.py`. Then run it from the command line and pass it a text and a value for n: *python3 extractngrams.py filename n*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import preproces\n",
      "import sys\n",
      "\n",
      "if len(sys.argv) != 3:\n",
      "    print(\"Not enough arguments specified! Specify two parameters on the command line:\")\n",
      "    print(\"Syntax: python3 extractngrams.py filename n\")\n",
      "    sys.exit(1) # this quits our program\n",
      "    \n",
      "filename = sys.argv[1]\n",
      "try:\n",
      "    n = int(sys.argv[2])\n",
      "except ValueError:\n",
      "    print(\"N must be an integer!\")\n",
      "    sys.exit(1)\n",
      "\n",
      "text = preprocess.read_corpus_file(filename)\n",
      "sentences = preprocess.split_sentences(preprocess.tokenise(text))\n",
      "for sentence in sentences:\n",
      "    for ngram in preprocess.get_ngrams(sentence, n)\n",
      "        print(\" \".join(ngram))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ignore the following, it's just here to make the page pretty:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import HTML\n",
      "def css_styling():\n",
      "    styles = open(\"styles/custom.css\", \"r\").read()\n",
      "    return HTML(styles)\n",
      "css_styling()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<style>\n",
        "    @font-face {\n",
        "        font-family: \"Computer Modern\";\n",
        "        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
        "    }\n",
        "    div.cell {\n",
        "        width:800px;\n",
        "        margin-left:auto;\n",
        "        margin-right:auto;\n",
        "    }\n",
        "    div.cell, .input.hbox {\n",
        "    display:block;\n",
        "}\n",
        "    h1 {\n",
        "        font-family: \"Charis SIL\", Palatino, serif;\n",
        "    }\n",
        "    h4{\n",
        "        margin-top:12px;\n",
        "        margin-bottom: 3px;\n",
        "       }\n",
        "    div.text_cell_render{\n",
        "        font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
        "        line-height: 145%;\n",
        "        font-size: 120%;\n",
        "        width:800px;\n",
        "        margin-left:auto;\n",
        "        margin-right:auto;\n",
        "    }\n",
        "    .CodeMirror{\n",
        "            font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
        "    }\n",
        "    .prompt{\n",
        "        display: None;\n",
        "    }\n",
        "    .text_cell_render h5 {\n",
        "        font-weight: 300;\n",
        "        font-size: 16pt;\n",
        "        color: #4057A1;\n",
        "        font-style: italic;\n",
        "        margin-bottom: .5em;\n",
        "        margin-top: 0.5em;\n",
        "        display: block;\n",
        "    }\n",
        "    \n",
        "    .warning{\n",
        "        color: rgb( 240, 20, 20 )\n",
        "        }\n",
        "</style>\n",
        "<script>\n",
        "    MathJax.Hub.Config({\n",
        "                        TeX: {\n",
        "                           extensions: [\"AMSmath.js\"]\n",
        "                           },\n",
        "                tex2jax: {\n",
        "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
        "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
        "                },\n",
        "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
        "                \"HTML-CSS\": {\n",
        "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
        "                }\n",
        "        });\n",
        "</script>"
       ],
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "<IPython.core.display.HTML at 0x1044283d0>"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "by Folgert Karsdorp (Meertens Instituut) en Maarten van Gompel (Radboud University Nijmegen)\n",
      "\n",
      "Licensed under the [GNU Free Documentation License](http://www.gnu.org/copyleft/fdl.html)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}